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Abstract 

Kernel estimation of a probability density function supported on the unit interval has 
proved difficult, because of the well known boundary bias issues a conventional kernel density 
estimator would necessarily face in this situation. Transforming the variable of interest into a 
variable whose density has unconstrained support, estimating that density, and obtaining an 
estimate of the density of the original variable through back-transformation, seems a natural 
idea to easily get rid of the boundary problems. In practice, however, a simple and efficient 
implementation of this methodology is far from immediate, and the few attempts found in 
the literature have been reported not to perform well. In this paper, the main reasons for 
this failure are identified and an easy way to correct them is suggested. It turns out that 
combining the transformation idea with local likelihood density estimation produces viable 
density estimators, mostly free from boundary issues. Their asymptotic properties are derived, 
and a practical cross-validation bandwidth selection rule is devised. Extensive simulations 
demonstrate the excellent performance of these estimators compared to their main competitors 
for a wide range of density shapes. In fact, they turn out to be the best choice overall. Finally, 
they are used to successfully estimate a density of non-standard shape supported on [0, 1] from 
a small-size real data sample. 
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1 Introduction 



Kernel density estimation is a standard nonparametric method for estimating a probability density 



without making any rigid assumption about the distribution of the data, see for instance Wand and 



Jones 


(1995 


) and 


Simonoff 


(1996) 



univariate distribution Fx admitting a density fx, the conventional kernel density estimator is 

' x - Xi 



i=l v 7 



'1.11 



The 'kernel' function K is usually taken to be a symmetric unimodal probability density and the 
'bandwidth' h is the parameter that controls the smoothness of the final estimate. The huge 
amount of literature on kernel density estimation testifies that, provided that the value of h is 



sensibly selected, estimator (1.1 ) is a reliable one for estimating fx in a flexible way. 

It is, however, well known that it is not appropriate when the support of fx is bounded. The reason 
is that the estimator does not 'feel' the support boundaries and, for x close to a boundary, places 
some positive mass outside that support. This results in a significant bias which may prevent the 
estimator from being consistent in those areas. As support restrictiveness is common in practice, for 
instance when X is known to be positive, curing that boundary bias problem has attracted a lot of 



attention in the literature, see, inter alia, Schuster (1985), Miiller (1991), Lejeune and Sarda (1992), 



Jones (1993), Jones and Foster (1996), Cowling and Hall (1996), Cheng et al (1997), Zhang and 



Karunamuni (1998, 2000), Zhang et al (1999), Chen (2000), Hall and Park (2002), Park et al (2003) 



Scaillet (2004) and Karunamuni and Alberts (2005). Recently, Dai and Sperlich (2010) suggested 



a simple correction based on utilizing a local bandwidth close to the boundary. Of main interest in 



this paper, though, will be a procedure close in spirit to that suggested in Marron and Wand (1994), 
namely transforming the variable of interest into another one whose density estimation should be 
free from boundary problems, and transform that estimate back into the initial scale. Although 
this transformation method seems very natural and has been around for a long time, a simple and 
efficient practical implementation of this methodology is yet to be developed, to the best of this 



author's knowledge. Marron and Wand (1994) 's method, for instance, has often been labeled 'very 



complicated' in the subsequent literature. 
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This paper actually investigates the transformation method in the case where the support of fx is 
compact, for instance the unit interval X = [0, 1]. As there is no loss of generality in considering the 
[0, l]-support case only, for any compact domain can be mapped onto [0, 1] by straightforward linear 
rescaling, this case will be the only focus here. When the density is supported on [0, 1], it is clear 
that the above mentioned bias issues are even more disturbing. Indeed, in that case both boundaries 
are likely to affect the performance of the density estimator over a major part of the support of fx- 
Of course, each boundary can be considered separately and some ad hoc surgery on the estimator 
can be performed close to them using methods like those cited in the previous paragraph. It seems, 
however, more natural to devise estimators which automatically take the constrained nature of X 



into account from the outset. In fact, such estimators were indeed suggested in Chen (1999), Jones 



and Henderson ( 2007afb ) and recently, in a wider framework, Botev et al (2010). 



The key idea is that, in order to properly handle the bounded support X of fx, a kernel estimator 



like (1.1 ) should use a kernel function K which is also supported on X, and this for all combinations 
of x, X,i and h. This would indeed prevent it from assigning positive weight outside X and the 



consequences thereof. Accordingly, Chen (1999) suggested to take K to be a suitably parameterized 



Jones and Henderson 



Beta density, defining the Beta kernel estimator fx,c 2 - On the other hand 
( 2007a|b ) rather suggested to use as kernels the conditional densities extracted from a bivariate 
Gaussian Copula, leading to their Gaussian Copula kernel estimator, fx,cc- The notations fx,c 9 
and fx,GC have been borrowed from 



Jones and Henderson 



( |2007a||b~| ) (except for the subscript 'X'). 



The asymptotic bias and variance of the Beta kernel estimator were given in Chen ( 1999 ), and further 



asymptotic properties were studied in Bouezmarni and Rolin (2003) and Zhang and Karunamuni 



(2010). As beta- kernel estimators are directly related to Bernstein polynomial smoothing, relevant 



information can also be found in Brown and Chen (1999), Bouezmarni and Rolin (2007) and Leblanc 



(2012). Jones and Henderson ( 2007a[b ) derived similar theoretical properties for their Gaussian 
Copula estimator and provided a detailed comparison between fx,c 2 and fx,GC both in theory 
and in practice. They stressed the overall very good practical performance of the Beta kernel 
estimator, but they also pointed out some shortcomings that fx.cc was aimed at correcting. In 
particular, fx,c 2 has, by construction, a propensity for estimating /(0) by a nonzero finite value in 
all cases. Hence it struggles to appropriately estimate densities such that /(0) = (same at x — 1) 
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or densities admitting a pole at one (or both) of its boundaries. Finally, Jones and Henderson 
( 2007a|b ) suggested a 'rule-of-thumb' reference bandwidth selector for easily choosing the smoothing 
parameter for both the Beta and the Gaussian Copula kernel estimators. Their comprehensive 
simulation studies showed that fx,c 2 an d fx,GC were roughly level on overall performance for a 



wide range of density shapes on [0, 1]. Botev et al (2010 )'s idea is somewhat different and has to be 
replaced within the frame of a kernel density estimation approach elegantly based on the properties 
of linear diffusion processes. This estimator, called the diffusion estimator and denoted fx,<as in 
this paper, was claimed to cure boundary bias problems as well. Developing and investigating a 
competitor for these three estimators is actually the main aim of this paper. 

As stated earlier, the estimator proposed here is based on transformation. Combining kernel density 



estimation and transformation is not a novel idea, though. This was first suggested in Devroye and 



Gyorfi (1985, Chapter 9) and Silverman (1986, Sections 2.9 and 2.10), and then studied in depth in 



Wand et al (1991), Park et al (1992), Ruppert and Cline (1994), Hossjer and Ruppert (1995) and 



Yang and Marron (1999). Later, it was taken up in Bolance et al (2008), Buch-larsen et al (2005) 



and Markovich (2005). There, the transformation did not aim at taking care of boundary bias, but 



rather dealing with other density features known to make the estimation difficult, such as skewness, 
smoothness inhomogeneity or heavy tails. Some exceptions, explicitly targeting boundary bias 



reduction, are the above mentioned Marron and Wand (1994), Karunamuni and Alberts (2006) and 



Koekemoer and Swanepoel (2008). In fact, all those papers attempt to obtain, via transformation 



of the initial data, a pseudo-sample coming from a distribution 'easy to estimate' in some sense, like 
the Uniform or the Normal distribution. Of course, the transformation able to produce the selected 
target distribution depends on the initial distribution of the original data, and should therefore be 
estimated from them. This need for estimating a tailored transformation unfortunately makes those 
procedures somewhat unwieldy and in a sense spoils the simplicity of the initial transformation idea. 

What is suggested here markedly departs from those previous contributions in that a fixed trans- 
formation is considered, without any attempt to produce any particular distribution. This can be 
motivated inversely to what has been said for the Beta and Gaussian Copula kernel estimators. 
Instead of working with unusual kernels whose supports are always the support X of X, here that is 
X which is transformed such that it always matches the support of usual kernels for any combination 
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of x, Xi and h, that is the whole K. Loosely speaking, here the transformation just aims at sending 
away the boundaries to ±00. This can potentially be achieved by any transformation T : [0, 1] — > R, 
continuous and increasing, such that lim^o T(x) = —00 and lim^i T(x) = +00. A natural choice 
seems to be the standard normal quantile function, denoted throughout the paper, hence the 
name of probit-transformation for the kernel estimators studied in this paper. 



It is acknowledged that this idea was mentioned in Jones and Henderson (2007b, Section 6.2), 
but those authors reported (p. 17) that "(...) standard kernel density estimator with rule- of -thumb 
bandwidth applied to pr obit-trans formed versions of the data (...) did not compete well (...)" against 



Chen (1999)'s and Jones and Henderson ( 2007a )'s estimators. This was also pointed out in other 



papers implementing the idea as-is. Yet, upon reflection, it seems clear that standard kernel estima- 
tion using a standard bandwidth is certainly not the right choice in this very situation, as discussed 
at length in the next section. In this paper, a totally viable probit-transformation kernel density es- 
timator is devised. The good point is that, the transformation being fixed, the suggested procedure 
is barely more complicated than any other conventional nonparametric density estimation. 

In the next section, the idea of probit transformation is introduced in more detail, the reasons why a 
straightforward application of it is not expected to give acceptable results are set out, and a way to 
enhance that naive estimator is suggested. Section [3] studies the theoretical asymptotic properties 
of the suggested estimators. Section [4] studies the always crucial point of bandwidth selection in 
this particular setting. Some practical rules are devised, of which some will prove best in practice in 
the simulation study of Section [5j where the different estimators discussed in the paper are put to 
the test. In Section [6j the density of a real data sample is successfully estimated via the proposed 
methodology. Finally, Section [7] concludes and suggests some way to continue this research. 



2 Probit-transformation kernel density estimation 
2.1 Probit-transformation 

As anticipated in the previous section, the suggested procedure for estimating a density supported 
on the unit interval is the following. Denote S = $ _1 (A) the variable of interest in the probit- 
transformed domain. It is clear that, if fx( x ) > almost everywhere on [0,1] (which will be 
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assumed throughout this paper), then S has unbounded support. From standard arguments, the 
density of S, say fs, is related to that of A by fs(s) = fx($(s))(f>(s) for all s G E, where <3> is 
the standard normal cumulative distribution function, and its density. Although obvious, it is 
stressed that S has no particular reason for being normally distributed. In fact, that will only be 
the case if the density of A on [0, 1] is one of the conditional densities extracted from a bivariate 
Gaussian copula, including as particular case X ~ ^[o,i] (then S ~ A/"(0, 1)). It is on this fact that 



Jones and Henderson 



(2007a b)'s 'Rule-of-Thumb' bandwidth selector for fx,GC is based. 

Naturally, one can also write fx{%) — fs(&~ 1 (%))/4 ) (.&~ 1 (. x )) f° r an x £ [0, 1] (to be understood as 
lim^-i-oo fs(s)/4>(s) at x = 0/1). It therefore follows that any estimator f s of fs instantly provides 



an estimator of fx, viz. 



f(T) (x) _ fs^M) , 2 V 

Ix {X) ~ 0($-i(x)) ' 1 ' 



where the superscript (T) refers to the idea of transformation. As S is unconstrained, fs is free 
from boundary problems, and mostly so should be fx - It is also clear that fxi x ) cannot allocate 
any positive probability outside the genuine support X = [0, 1] of fx- 

2.2 The naive estimator 

From the 'pseudo'-sample {Si, . . . , S n }, with Si = $ -1 (Aj), i — 1, . . . ,n, a first idea is to estimate 



fs with a standard kernel density estimator on the real line, that is an estimator like (1.1): 



nh 

i=i 



Back-transforming this to the A-domain through (2.1) directly yields 



fT{x) = _i_ ± K («-'('> -»"(*>) . (2 .3) 

Noting that = l/0($- 1 (x)), one can write the linear Taylor expansion $ 1 {Xi) ~ 

+ (Aj — x)/0($ _1 (x)) for Aj close to x, whence 

?(T) / \ 1 ( X - Xi 

fx \ x ) 



Thus, fx (x) essentially behaves like a conventional kernel estimator with a local bandwidth h*(x) 



i-e. a bandwidth depending on the estimation point x, as already stressed in Wand 
(1991). As lim^o/i 4>(^^ 1 (x)) = 0, this shows that fxP( x ) intrinsically uses smaller and 



et al 



smaller bandwidths when x is approaching or 1, and this is essentially how it attempts to cure 



the boundary bias problem. As such, this is similar to what Dai and Sperlich (2010) suggested 



Below, this estimator (2.3) will be called the naive probit-transformation kernel density estimator, 
as a simple example suffices to illustrate the problems that it faces. A sample of size n = 1000 



was generated from the C/jo,i]- distribution, and its density was estimated by (2.3). Here, f$ is the 



standard normal density, therefore in (2.2) K was chosen to be the Gaussian kernel and h was 



selected by Normal reference rule (Wand and Jones, 1995, Section 3.2.1), hence close to be optimal. 



Figure 2.1 shows the estimated densities in both the S'-domain (left) and the X-domain (right). 
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Figure 2.1: True (dotted) and estimated (plain) densities in the S-domain and in the X-domain for 
a simulated sample (n = 1000) from Z7[o,i]- The kernel was the standard normal density and the 
bandwidth selected by Normal reference rule was h = 0.27. Corresponding X-scaling is shown on 
top of the left panel for benchmark. 



In the left-panel, (2.2) does reasonably good at estimating the normal density in the S'-domain. 



When this f$ is transformed back to the X-domain, however, it produces the poor estimate in the 
right panel. In the interior, it decently estimates the Uniform density. However, towards and 1, 
where the attention mostly lies (fx * s aimed at curing problems at the boundaries), its shows a 
very erratic behavior with acute peaks/troughs and even explosion (at the right boundary). This 
latter feature is easily explained: the right tail of fs is noticeably thicker than that of the genuine 



Normal distribution, which obviously get their ratio, i.e. f x from (2.1), to tend to oo as x — > 1. 



The phenomenon at the left boundary is also typical of what may be observed with such an estimate. 
In fact, fs is slightly rougher than and slowly meanders around it. Although quite discreet 
and not much disturbing on the S'-scale, these fluctuations are magnified greatly by the back- 



transformation, especially those in the tails of fs (when takes on small values in (2.1 )). Moreover, 
in the A-domain, the frequency of the fluctuations gets higher and higher when approaching the 
boundaries. Of course, this is again the straight effect of the back-transformation: the tails of fs 
are bluntly shrunk back into [0, 1], resulting in the supremely wiggly behavior of f^p close to (the 
first trough is so acute that it is barely visible). The A-scaling on top of the left panel, which gives 
the standard normal quantile for the corresponding S-value, allows one to contemplate the amount 
of that shrinkage. In short, the back-transformation does not allow the homogeneous smoothness of 



the estimate on the S-scale to carry over to the A-scale. As is, this estimate (2.3) is not acceptable, 



indeed, like Jones and Henderson (2007b) deplored. 



2.3 An improved probit-transformation kernel density estimator based 
on local likelihood estimation 

The foregoing discussion, however, inspires some improvements for the probit-transformation esti- 
mator. First, focusing on the estimation of fs without the final estimate of fx in mind is pointless, 
as a good estimator for fs does not automatically becomes a good estimator for fx- More specifi- 
cally, it is clear that a global bandwidth on the S'-scale cannot produce an estimate of fx which is 
reasonably smooth all over it support. Rather, it seems sensible to work with a local bandwidth on 
the S'-scale. Note that this goes against the initial motivation for using transformations in kernel 



density estimation: Wand et al (1991) primarily suggested using a transformation to be allowed to 
conveniently use standard kernel estimation with global bandwidth on the transformed scale, when 
this was not effective on the initial one. Bandwidth issues are discussed in details in Section HI 

On another note, as exemplified above, it is particularly important that the tails of fs are estimated 
smoothly and, of course, as accurately as possible. Yet, standard kernel density estimators are 



known to do poorly in the tails ('spurious bumps'), so it is clear that (2.2) is not the best choice 
for estimating fs here. In fact, the smoothness of fs should be guided so that it is similar to the 
smoothness of <fi all over the real line, in particular in the tails. Here, "same smoothness as 0" has to 



S 



be understood not in terms of fs being infinitely many times different iable (this is known to depend 
on the kernel K only), but an estimate fs not spuriously fluctuating much around anywhere. An 
easy way to achieve this is to ask fg to locally behave like 0, and this naturally leads to estimating 



fs via local likelihood methods (Loader, 1996, Hjort and Jones, 1996, Park et al, 2002). 



Hjort and Jones (1996)'s local likelihood approximates locally the unknown fs by a density from a 



given parametric family, for which the Gaussian family is the obvious choice here. Now, given that 
(f)(s) = (27I") -1 / 2 exp(— s 2 ), this is also equivalent to locally approximating log fs by a constrained 



polynomial of degree 2. Locally fitting a polynomial to the log-density is actually Loader (1996)'s 



formulation of local likelihood estimation, and has some advantages over local parametric density 
estimation: the practical implementation is easier (closed-form expressions often exist for the esti- 
mators) and the asymptotic theory is more transparent (the bias does not depend on a 'best local 



parametric approximant' which is hard to define in practice). Therefore, only Loader (1996)'s local 



log-polynomial density estimation will be used here for fs, although it is stressed that the motiva- 
tion for that was originally to be found in local Gaussian estimation. In any case, final estimates 
using either formulation of local likelihood should be comparably similar. In particular, both are 
known to have favorable tail behavior and to be able to correctly make it up for the derivatives of 
the underlying density. Therefore, local likelihood estimation really meets the needs exposed above. 

The local log-polynomial likelihood method assumes that, around s, the log-density can be well 
approximated by a polynomial of some order, sayp: log/s(t) ~ aQ(s)+a 1 (s)(t—s) + . . .+a p (s)(t—s) p . 
The local coefficients are then found by solving a weighted maximum likelihood problem 



{a {s), . . . ,a. 



(s)) = arg max < \ K \ 

I 1=1 v 



-n / K 



Si - s 
h 

t - s 
~h~ 



(a + ai(5i - s) + . . . + a p (Si - s) p ) 
(a + a x {t - s) + . . . + ap(t - s) p ) dt J> . (2.4) 



The estimate of fs at s is then defined as fs P \ s ) — exp(ao(s)), which finally yields, via (2.1) 



x) 



/g ($ 1 (x))/0($ an estimate of fx at any x 6 (0, 1). Following the idea of local 



Gaussian estimation, only p = 1 or p = 2 will be considered in ( |2.4 ). This yields two improved 

~( r T'\ ) 

pr obit-trans formation kernel density estimators: f x (x) (based on local log-linear estimation of 
fs, p = 1) and f^ 2 \x) (based on local log-quadratic estimation of fs, p = 2). In the next 
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section, the asymptotic properties of these two estimators are derived, after those of the naive 
probit-transformation kernel density estimator (2.3). 



3 Asymptotic properties 



The kernel K, in (2.3) as well as in (2.4), will be assumed throughout to be the Gaussian kernel, 
as this seems to be the obvious choice in this framework. For that kernel, J u 2 K(u) du — 1 and 
J K 2 (u) du = t^t^. These quantities frequently arise in the properties of kernel density estimators, 
and direct use of those particular values will be made in the following results. 

3.1 The naive probit-transformation kernel density estimator 



From (2.1), it is clear that, for any x E (0, 1), 



W)) a,id v » = V-w) ii! 



The properties of the conventional estimator fg (2.2) are well known. If fg is twice continuously 
differentiable at s E R, then 

E (f s (s)) = f s (s) + \h 2 n{s) + o(h 2 ) and Var (j 8 ( 8 )) = ^= + o^uli)- 1 ) (3.2) 



if h — > and nh — > oo as n — > oo, conditions which consequently guarantee the consistency of f$. 

From fs(s) = fx($(s))4>(s) and using (fi'(s) = —s<p(s) and 4>"(s) = (s 2 — l)<f>(s), one obtains upon 
differentiation 

fs(s) = fx(m)<P 2 (s) - afxWaMs) (3.3) 
and f" 8 {a) = ^($( S ))0 3 ( S ) - 3s^($( S ))0 2 ( S ) + (s 2 - l)/ x ($( a ))^(a). (3.4) 

Given that and $ are infinitely many times differentiable on R, fg is seen to be twice continuously 



differentiable at s if and only if fx is twice continuously differentiable at $(s). Plugging this in (3.2) 
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and then in (3.1) yields, for any fixed x G (0, 1) at which fx is twice continuously differentiable: 



E(/?W) =fx(x) 

+ \h 2 {^(x)0 2 ($- 1 (x)) - 3^(x)$- 1 (x)0($- 1 (x)) + ({^(x)} 2 - l)fx(x)} + o(h 2 ), (3.5) 



and Var ( (x 



fx(x) 



+ o((nh)- 1 ) 



(3.6) 



2n/i0($" 1 (x)) v ^F 

Now, as $ _1 (x) and l/0($ _1 (x)) grow unboundedly as x approaches or 1, these results have to 
be modified when considering sequences of values x n — > 0/1. For a sequence such that x n /h m — > r] 
or (1 — x n )/h m — » r] as n — )■ oo for some m, ^ > 0, the asymptotic bias actually becomes 



bias (fx\x n ) \ ~ Cmft 2 log h 1 /x(^ r 



(3.7) 



for some constant C, and the asymptotic variance 



Var(/f } (x n 



(3- 



from the limit ($ 1 (x n )) 2 /(— 2 logx n ) — > 1 as x n — > (and similar for x n — >■ 1). It turns out that 



(3.6)/(3.8) is exactly the asymptotic variance of the Gaussian Copula kernel estimator (Jones and 



Henderson 



2007a 



bp. Like fx,GC an d /x,c 2 > fx su ff ers from that unusual boundary effect described 



in 



Jones and Henderson ( 2007a[b ) and Leblanc (2012), namely an increase in the variance order 
over a small region close to the boundary. The bias of f^i as seen from (3.5), is also of similar 



form to the asymptotic bias of fx,GC, except for the third term, viz. ({$ 1 (x)} 2 — l)f x (x), which 



is responsible for the boundary bias of higher order 0(h 2 log h 1 ) described by (3.7). Of course, the 



problem comes from the unbounded nature of $ ( 0/1, and it appears that only densities 

fx tending to very smoothly as x approaches the boundaries will be correctly estimated there by 
f x . Otherwise, the estimator is prone to exploding, like what was seen in Figure 



2.1 



(right). 



Expression (3.5), however, suggests an easy way to fix this, at least asymptotically. Indeed, instead 
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of (2.1), define the amended estimator 



-\x)) (1 + \h*({Q-\x)¥ - 1)) 



Then, it is clear that E [Q >{x) = Eif^' (x) /(I + ^({^(a:)} 2 - 1)), and given that 



l + W{{<b-\x)¥ -I) 



1 - -h\{^-\x)Y - 1) + o(h 2 ) 



(3.9) 



as h — > 0, it can easily be seen that the amendement exactly makes it up for the the third term in 



(3.5), leading to 



E (/x (T) (*)) = fx(x) + -h 2 {ftixWp-^x)) - 3^(x)$- 1 (x)0($- 1 (x))} + o(h 2 ). (3.10) 



The asymptotic variance is unaffected by this amendment, and remains equal to (3.6) (or its bound- 
ary version (3.8)). The expectation of f x ( x ) is now very similar to that of fx,Gc( x ): which is 



(Jones and Henderson 2007af b) 



E fjxM*)) = M*) + \ h " (2^(x)0 2 ($- 1 (x)) - 4^(x)$- 1 (x)0($- 1 (x))} + o(h 2 ). (3.11) 



Neither is uniformly smaller (in absolute value) than the other, however (3.10) seems to get an 



edge over (3.11). For instance, at x — 0.5 (there $ 1 (x) = 0), the bias expression of fx,GC j us t 



shows an extra factor 2. Sensible values for the bandwidth h in both expressions may be quite 
different, though, so that the comparison is not that straightforward. However, reproducing the 



analysis shown in Jones and Henderson (2007b, Section 4.4) allows one to find that the 'multiplier' 
of the optimal bandwidth for f x is ho = 2.5679, whereas it was 2.049 for fx,c 2 and 1.946 for 
fx,cc- Hence, in terms of asymptotic bias at x — 0.5, does slightly better than fx,GC but 

slightly worse than fx,c 2 - Further, the 'multiplier' for the optimal Mean Squared Error of f x is 



found to be (647r 2 ) 1//5 , that is, exactly what Jones and Henderson (2007b) found for the multiplier 



of the MSE of the Beta kernel estimator at x — 0.5 (and consequently smaller than that of fx,Gc)- 
Asymptotically, this amended probit-transformation estimator therefore appears to be a serious 
competitor for the other two. In practice, though, good performance for it is not guaranteed, in 
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particular close to the boundaries: there, {$ 1 (x)} 2 is large, and the linear approximation (3.9) is 
dubious. Thus, the amendment may not bring the expected bias correction. Also, the amendment 
implies that f x does not integrate to one any more, which calls for a renormalization such as 



fx (T \x) <- / 



*(T), 



1 f*(T) 



X 



x) dx. Such a renormalization is, however, commonplace in other 



nonparametric density estimation procedures and is not an issue per se. 

Finally, it is noted that the similarities in behavior of the probit-transformation estimator (at least 

estimator fx,GC, as evidenced by their 



its amended form) and 



Jones and Henderson 



almost identical asymptotic bias and variances, is not surprising. As said earlier, normal densities 
in the S'-scale transform into conditional densities from bivariate Gaussian copulas in the A-scale. 



Now, if the estimator (2.2) is thought of as the sum of Gaussian bumps kernel estimators are usually 



understood as, then back into the X-scale, those bumps become the 'Gaussian Copula' bumps the 

Figure 3). 



estimator fxcc is constructed from (Jones and Henderson 



2007b 



3.2 Improved probit-transformation kernel density estimators 

Now the properties of the 'improved' probit-transformation estimators of fx based on local log- 



polynomial estimation of fs are derived, starting with p — 1 in (2.4). Hjort and Jones (1996 



Section 5.2) derived a closed-form expression for that estimator, and concluded, in accordance with 



Loader (1996), that, if fs is twice continuously differentiate at s and fs(s) > 0, 



E [fl\s)) = fs{s) + \h 2 (r 8 {s) - + o(h 2 ) = f s (s)+ 1 -h'f s (s) (log f s (s))" + o(h") (3.12) 



and yK[ft\ 8 ))=J*W 



+ o((nhy 1 ). 



(3.13) 



Now, defining the corresponding estimator for fx'- f^^ix) = /^($~ 1 (a;))/0($ _1 (x)), and acting 
as in Section |3~T using (3.3) and (3.4) in (3.12) and ( 3.13[ ), it follows that, at any fixed x G (0, 1) 
such that fx(x) is twice continuously differentiable and fx{%) > 0, 



E ( fF\x)) = fx(x) + \h 2 | (%(x) - f^-\x)) - f' x {x)^\x)<P^-\x)) - + o 

(3.14) 

= f x {x) + \h 2 f x {x) {(\ogf x (x))" ( f> 2 (^ 1 (x)) - (\ogfx(x))'^- l (x) ( f>^- 1 (x)) - 1} +o 
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and Var ( / 



fx(x) 



+ o((nh)- 1 ). 



(3.15) 



2n/i0($- 1 (x)) v /7r 

The variance is the same as that of the naive version of the estimator (and indeed it is again 



as in (3.8) for sequences of x-values converging to one of the boundaries polynomially in h), but 



the bias expression is noticeably different. First, it now mainly depends on the derivatives of the 
log-density, which is a characteristic of local log-polynomial likelihood estimation. Second and 



most importantly, it is free from the boundary effect materialized by (3.7) (it is stressed that the 



function $ _1 (x)0($ _1 (x)) appearing in (3.14) is bounded). Consequently, (3.14) also holds for 
x — > 0/1, provided fx(%) > 0. Problems at x such that fx(%) = are well known for local log- 
likelihood methods, since the singularity of the log-density cannot be appropriately approximated 
by local polynomials. Yet, when such a situation arises at one of the boundaries, the problem is 
actually somewhat mitigated by the transformation, compared to raw local log-polynomial likelihood 
estimation of fx, as $ _1 (x)0($ _1 (x)) and 2 (<3> -1 (x)) — > as x — > 0/1. Finally, it is noted that, 
in general, f x (and f x below) will not integrate to one. This problem will, however, vanish 
asymptotically and in any case can be corrected for in practice using a simple renormalization as 
the one suggested in the previous subsection. 

The second improved probit-transformation kernel density estimator is obtained when locally fitting 



a second order polynomial for the log-density, that is, taking p = 2 in (2.4). Again, Hjort and Jones 



(1996) provided a closed-form expression for this estimator. They also pointed out that, as for local 



polynomial regression (Fan and Gijbels, 1996), locally fitting a second order polynomial decreases 



the order of the asymptotic bias from 0(h 2 ) to 0(h 4 ), provided that f$ has four continuously 



differentiable derivatives at s. Then, if fs(s) > 0, adapting the results of Loader (1996) yields 



E (fj\s) ) = fs(s) - ^f s (s) ((log f s (s))"" + 4(\og fs(s))'" (log f s (s))>) + o(h 4 



fs(s) - -h* f» 3 »(s) - 3 



and 



f_M 

fs(s) ' " fi(s) 



+ 2 



+ o(h 4 



(3.16) 



(3.4) can be differentiated further to obtain the first four derivatives of fg in terms of fx and 



), and one gets for any fixed x 6 (0,1) such that fx(x) is positive and four times continuously 
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differentiable: 



x 



fx(x) 



fM + 2 |M)^(,)) 



+ 2(9 fA fJ^ X) ~ 4 |H " 5/£(s)) ^\x)^-\x)) 



+ ((19 {$>-\x)} 2 - 4) r^x) - 15 {$-\x)y 

+ (^(x) - 5 {^(x)} 3 ) f'xixM*- 1 ^))) + o(h 4 ) 



fx(x) 



(3.17) 



and 



Var / 



27 



fx(x) 



+ o((nh) 



(3.18) 



16 2n/i0($" 1 (x)) v ^F 

Compared to that of f x (x), the variance has been inflated by a factor 27/16, which is the usual 
variability inflation factor from local linear to local quadratic regression estimators when a Gaussian 



kernel is used (Fan and Gijbels, 1996, Section 3.3.1). Similarly to (3.8), for a sequence of values 
x n tending to or 1 such that x n /h m — > i] or (1 — x n )/h m — > rj for some m,r] > 0, the asymptotic 



variance is Var ( / 



27/16 f x (x n )/(nh 1+2m \/2r) 2 ). On the other hand, the bias is now of 



order 0(h 4 ), as expected, and its expression involves the first four derivatives of fx- The important 



point, though, is that, like (3.14), this holds also for x — > 0/1 (provided fx(x) > 0) with no extra 
boundary effect. Besides, this leading bias term even tends to as 1 tends to or 1, since the 
functions {$ _1 (x)} r s ($ _1 (x)) tend to 0, for r,s integers, r > 0, s > 0. This was not the case for 



fx (x), due to the third term in the bracket in (3.14). Therefore, at the boundary, the bias of 
fx (x) is actually o(h ), and this time the boundary effect appears advantageous. This is obviously 
true if fx(x) does not tend to too quickly and its derivatives do not grow too quickly to 00 when 



approaching the boundaries (i.e. if no terms in (3.17) grows unboundedly there). 



A natural question is what happens when fitting a local log-quadratic estimator if fx is not four 
times continuously differentiable. Remarkably, one still achieves some bias reduction, compared to 
fx (x), under the same smoothness assumption of two continuous derivatives. This can first be 

(...) the estimate will perform well; 



understood through Loader (1996)'s own words (p. 1612) 



existence or otherwise of density derivatives is incidental 1 . This fact will carry over to the estimate 
of fx, and in fact, upon inspection of Loader] (1996) 's proofs and the previous developments, it can 
be seen that in this case, provided that f x (x) > 0, E (fP\x)) = f x (x) + o{h 2 ). 
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It must be mentioned here that ad hoc bias reduction (from order 0(h 2 ) to 0(h 4 )) techniques based 
on multiplicative adjustment, and the resulting reduction of Mean Squared Error rate from the usual 



0(n -4 / 5 ) to 0(n~ 8 / 9 ), were attempted for Chen (1999)'s Beta kernel estimators in Hirukawa (2010), 
but that author observed that his simulation studies (p. 480) "do not necessarily support superiority 
of the modified beta kernel [i.e. the bias corrected version] over the beta kerneV . Contrariwise, 
combining probit transformation and local log-quadratic estimation achieves that asymptotic bias 
and MSE reduction in an automatic way (provided that fx is smooth enough, which was of course 



also assumed in Hirukawa (2010)), and is effective in practice, as evidenced in Section p 



3.3 On the interest of the transformation 



Local likelihood estimation of fs emerged naturally in Section |2.3| as it mostly addressed the draw- 
backs of the naive implementation of the probit-transformation idea. This said, transforming the 
data in the first place mainly aimed at dealing with boundary bias, and local likelihood estimators 
on their own have been advertised as having advantageous bias behaviors compared to conven- 



tional kernel estimators anyway (Loader, 1996, Hjort and Jones, 1996). Hence, the necessity of the 



transformation in this context can be questioned. There are, however, some reasons for persisting 
with the transformation idea even if the main estimation step is articulated around local likelihood 
methods. 

First, one cannot but notice that local likelihood estimation is far from imposing itself as the 
panacea for boundary bias correction, as evidenced by the large amount of other treatments still 
being suggested in the literature. Second, local likelihood estimation has been somewhat questioned 



m 



Hall and Tao (2002), who showed among other things that, for densities in C2OR), the MISE of 



the conventional kernel density estimator is at most that of the analog local log-linear estimator, 
and can be dramatically lower. Conclusions of this type are not that obvious when comparing 



integrated squared versions of (3.5) or even (3.9) to (3.14). Therefore, it seems heuristically that 



passing local likelihood methods through a transformation is somewhat beneficial to them. Finally, 
the transformation idea on its own is known to have numerous other advantages, which motivated 



its introduction in the first place (Devroye and Gyorfi, 1985, Wand et al, 1991). The beneficial 



effect of the probit-transformation clearly appears through Section [5}s simulation study. 
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4 Bandwidth selection 



4.1 Fixed bandwidth 



As in any nonparametric method, the smoothing parameter, in this case the bandwidth h for 



estimating fg via (2.4), plays a crucial role in how well the estimator performs. A theoretically 



optimal value of h can easily be derived by balancing the integrated asymptotic squared bias and 
variance of the considered estimator (( |3.14 )-( 3.15 ) or (3.17)-(3.18)). A plug-in approach, aiming 
at estimating the unknown quantities in that expression, seems however unsuitable here, given the 



complicated nature of the bias expressions, especially (3.17) 



Therefore, an approach based on least-squares cross-validation ideas seems preferable: for the esti- 
mators (q = 1, 2), this is selecting the bandwidth minimizing an estimated version of 



WISE(fc) 



fx Q \x) - f x (x) ) w(x) dx 



(4.1) 



for some non-negative weight function w. In their reference rule, Jones and Henderson ( 2007a[b ) 
considered the weight function w(x) = 0($ _1 (x)) when approximating WISE. Within the probit 
transformation setting, however, it is noticeable that this corresponds to an unweighted least-squares 
criterion in the S'-domain. Indeed, the simple change-of- variable $ _1 (x) = s yields 



fx q \^)-fx{x)) 4>{*- l (x)) dx 



1 ( f^\^(x))-f s ^~\x)) 



(9), 



0($ _1 O)) 

)-fs(s)) 2 ds. 



$ {x))dx 



With this weight function, the selected bandwidth h is thus the usual cross-validated bandwidth 
aiming at being optimal for estimating fg. However, as stressed in Section [273] , a bandwidth good at 
estimating fg may not be so when the real goal is to estimate fx- In fact, w(x) = 0($ _1 (x)) down- 
weighs the contribution from the boundary regions to WISE, and this is again in some contradiction 
with the original motivation of constructing estimators fx^ performing well at the boundaries. A 



natural idea may then appear to take w = 1 in (4.1 ), which amounts to choosing the bandwidth as 



the one value minimizing (fg q \ s ) ~ fs( s )j /0( s ) ds in the S'-domain. It was seen empirically 
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that this choice leads to severe oversmoothing, though, as definitely too much emphasis is put on the 
tails of fs then, not to mention that this is not even clear whether the previous integral converges. 

A good compromise between w(x) = and w(x) = 1 seems to be w(x) = (t>{^~ l {x))/ fx{x). 

Intuitively, this down-weighs the contributions of the boundary areas to WISE only if fx is large 
there. In a sense, this put the focus on the relative error rather than on the absolute one. Bandwidth 
selection via weighted cross-validation was studied in practice in the simulations (see Section [5]), 
and some success was also achieved by using the weight function w(x) = 0(<& _1 (a;)) / \J fx(x). In the 
S'-domain, this corresponds to weights oo(s) = 4>(s)/fs(s) and oo(s) = \/(j)(s)/ fs(s), respectively, 
which is again straightforward to see through the change of variable = s. Of course, 

in practice, fs(s) needs to be estimated in either u(s). For that purpose one can just use the 



conventional estimator fs(s) with bandwidth selected via direct plug-in (Sheather and Jones, 1991). 

50, essentially, the bandwidth h is selected via classic least-squares cross-validation for estimating 
fs, except that a weight function is added in the least-squares criterion. Specifically, 

{/"+°° r~"i 2 9 n ~ 1 

where Cj denotes the estimated version of the considered weight function u, and, as usual in cross- 
validation methods, fsf-i) is the 'leave-one-out' version of fg computed on all the observations but 

51. Similar ideas of weighted cross-validation have been used in many other contexts before, and 
cross-validation routines from any software can be used to easily perform this optimization task. 

It is noted that bandwidth selection via likelihood cross-validation, i.e. based on minimizing a 
Kullback-Leibler divergence rather than a least-squares one, was also initially considered. The 
estimators f s g ^ being local likelihood estimators, this seems a very natural idea in this setting. 
The strong links between local likelihood methods and Kullback-Leibler divergence were already 



underlined in Hjort and Jones (1996, Section 2) and Lee and Park (2006). However, the method 



did not perform competitively against the others, and the results are not shown for sake of brevity. 
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4.2 /c-Nearest-Neighbor bandwidth 



The above discussion focus on a fixed bandwidth h, constant over K when estimating f$. However, 



as discussed in Section 2.3 a local bandwidth seems more appropriate on the S'-scale for producing 



homogeneous smoothness for f^ q ^ over X in the A-domain. An easy way to achieve local bandwidth 
estimation, and the adaptive behavior thereof, is to use fc-Nearest-Neighbor (fc-NN) methods. For 
conventional kernel density estimation, though, fc-NN methods are known to produce extremely 
rough density estimates with fat tails, and are consequently barely used. That said, it appears 
that those issues mostly disappear when they are used in conjunction with local likelihood density 



estimation, as already pointed out in Simonoff (1996, Section 3.4). Also, Davison and Hall (2007) 



suggested a procedure which is essentially close to fc-NN ideas to remove the bumps in the tails of 
the kernel density estimator. Hence using a fc-NN bandwidth seems totally appropriate here. 

Specifically, the local bandwidth, say h(s), is set to d k (s) — \s — S7fc) )S |, where S^), s is the kth 
closest observation to s out of the pseudo-sample {Si, . . . , S n }. It is now a = k/n, the fraction of 
sample observations that will actively enter the estimation of fs{s) at any s, which will play the 
role of the smoothing parameter. As <3> _1 is monotonic increasing, this is actually also the fraction 
of observations from the initial sample {Xx, . . . , A n } that will actively enter the estimation of fx{x) 
at any x EX. The smoothing parameter is thus selected as if it was directly in the A-domain, hence 
the appropriateness of a fc-NN rule here. In practice, a can be chosen exactly as in the previous 



subsection, the minimization in (4.2) now being performed with respect to a instead of to h. 



Another insight into why fc-NN methods can be useful for the probit-transformation estimators is 
got through the following. For h a fixed bandwidth, the asymptotic variance of f[ j (s), q = 1,2 is 



C q x fs(s)/(2nhy^ir), with C\ = 1 and C*2 = 27/16, as given in (3.13) and (3.16). On the other 



hand, dk(s) is the kth order statistics of the 'sample' of distances between the observations and s, 



and it has been shown (Mack and Rosenblatt, 1979) that this is such that E (l/dk(s)) ~ 2fs(s)/a. 



Hence it can be seen that, under our assumptions, the variance of fg with such a fc-NN bandwidth 
is asymptotically Var [fs (s)) — C q fg(s)/(na^), provided that na — > oo and a — > 0. Back in 



the A-domain through (3.1), one gets 



Var(/f 9) (x)) ~C q x 



na\/7r 



(4.3) 
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This variance is now free from any factor l/0($ l (x)) which can make (3.15) and (3.18) arbitrarily 



large when x is close to the boundaries. Consequently, (4.3) should hold true as such also for 



x — > 0/1, free from boundary effects formalized by (3.8) or similar. 



The foregoing argument is only heuristic, though, as the asymptotic variance formulas (3.13) and 



(3.16) are valid only under the usual 'nonparametric assumption' that h — > ('small bandwidth 
asymptotics'). However, local likelihood methods may also act as semiparametric estimators, when 



h — > oo (see the discussion in Hjort and Jones (1996)), giving rise to another asymptotic behavior 



large bandwidth asymptotics'), see Eguchi and Copas (1998), Park et al (2006) for details. In 



practice, the limit between small and large bandwidth asymptotics is evidently quite fuzzy, and it 
is not clear which asymptotic approximation, if any, provides the best picture in a given situation. 
The question is even more relevant when the bandwidth is essentially random, as in the case of 
fc-NN estimation. This is another reason for selecting the smoothing parameter, either h or a, 
via cross-validation, as this is not directly based on asymptotics (unlike the plug-in bandwidth 
selection rules). In any case, the above discussion leading to ( 4.3[ ) was only meant to further 
motivate the usage of fc-NN methods for the probit-transformation estimators. Estimators using a 
fc-NN bandwidth will indeed prove superior in most of the situations in the simulations below. 



5 Simulation study 

The practical performance of the procedures exposed in the previous sections is now compared 
to the Beta kernel estimator fx,c 2 , the Gaussian Copula kernel estimator fx,GC an d the diffusion 
estimator /x,diff- For reference, the conventional kernel density estimator (without any correction) 



and its corrected version given by Dai and Sperlich (2010)'s 'simple boundary correction' have also 



been incorporated to the study. No other boundary correction methods have been considered, as Dai 



and Sperlich (2010)'s simulations evidenced that their simple correction was essentially matching 



the performance of the other, more elaborated methods. 

A thousand independent samples, of size n = 50 and n = 500, were simulated from the 16 [0, 1]- 
supported densities considered in Jones and Henderson ( 2007a[b ), see those papers for their exact 
description. These densities show a wide range of shapes and features (multimodality, smoothness 
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inhomogeneity, unboundedness at boundaries, cusp) and therefore allows a comprehensive compar- 
ison of the estimators under test. Those were used to estimate the different densities from the 
generated samples on an equally spaced grid of 999 points (from 0.001 to 0.999). For a given 
estimate, say /, the estimation accuracy was quantified by the Integrated Squared Error (ISE) 
approximated on the grid of points, i.e. ISE(/) = X^li f/(Vl000) — • The Mean 

Integrated Squared Error (MISE) of the different estimators was finally approximated by aver- 
aging the obtained ISE over the 1,000 Monte-Carlo replications. Results can be found in Table 



reftab:sim50 for n = 50 and Table 5.2 for n = 500. More details on practical issues are given below. 



For the Beta and the Gaussian Copula kernel estimators, fx,c 2 and /x,gc> the bandwidth was 
selected according to Jones and Henderson ( 2007a|b ) 's 'rule-of-thumb' methods. It must be noted 
straightaway that this may or may not be suitable. For this type of reference rules, the appropriate- 
ness of the selected bandwidth heavily depends on the adequacy of the assumed parametric shape 
for the true density. Naturally, for fx,c 2 , a Beta density is used as reference, whereas for fx,GC the 
reference density is a conditional density from a bivariate Gaussian Copula. In these simulations, 
the selected bandwidth will therefore be ideal for densities of these types (Densities 1, 3 and 5 for 
fx,c 2 , Density 15 for fx,cc), but may be inappropriate in other situations (typically for the multi- 



modal Densities 2, 13 and 14). There are apparently no other options in practice, though. Jones 



and Henderson 



(2007a) themselves compared the performances of fx,o 2 and fx,GC at estimating 



Densities 2, 13 and 14 using that 'rule-of-thumb', although none would give any reliable estimation 



of those three densities. Hirukawa (2010) does not act differently in his simulation studies, and this 



is therefore also what is done here. 



Botev et al 



(2010)'s diffusion kernel density estimator fx,<as is 



advertised to come with an automatic bandwidth selection rule, which was obviously used here. 



For the conventional kernel estimator fx and its Dai and Sperlich (2010 )'s corrected version (called 
fxcorr below), the bandwidth was selected via 



Sheather and Jones 



(1991)'s plug-in method on 



the raw data set. Dai and Sperlich (2010 )'s method is just to take as local bandwidth at x the 



minimum between the global, previously selected bandwidth and the distance from x to the closest 
boundary, and to apply a renormalization. For the naive probit-transformation kernel estimators, 



the bandwidth in the 5"-domain was again chosen according to Sheather and Jones (1991)'s rule 
on the pseudo-sample {Si, . . . , S n }. For the improved probit-transformation estimators, all the 
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possibilities described in Sections [4] were studied, which yielded 12 more estimators: estimators fx"^ 
based on local log-linear estimation of fs with fixed bandwidth h selected via 1) unweighted Least- 
Squares Cross- Validation (LSCV), 2) weighted LSCV with weight function tu(s) = a/ fs(s)/(J)(s) ( 
WLSCV1) and 3) weighted LSCV with weight function u(s) = fs(s)/<p(s) (WSLCV2); estimators 
fx based on local log-quadratic estimation of fs with fixed bandwidth h selected via the same 3 
methods (LSCV, WLSCV1, WLSCV2); and estimators /^ T1) and f ( P ] with fc-NN bandwidth with 
the value of a again selected via LSCV, WLSCV1, WLSCV2. In the above weight functions u(s), 



fs(s) was estimated by (2.2) with Sheather and Jones (1991)'s bandwidth. 



Finally, for assessing the usefulness of the transformation and backing the observations made in 



Section 3.3, the results for the local log-linear and log-quadratic density estimators computed di- 
rectly on the initial dataset {X±, . . . ,X n } are also shown. The bandwidth (h or a) was selected via 
least-squares cross-validation (LSCV). Those estimators are denoted f^ and fx below. 

All the computations were run in the R software, using the built-in functions for local log-polynomial 
density estimation available in the locf it package. The practical implementation of the discussed 
methods is therefore straightforward and available to anybody. 

5.1 Results - n = 50 



As seen from Table |5.1[ there is no uniformly best estimator dramatically outperforming all the 
others for all densities. In fact, 16 different estimators (out of the 23 which are considered) can claim 
to be best at estimating at least one of the 16 densities. The diffusion kernel estimator fx,<as an d 
the improved probit-transformation estimator fx" with a-WLSCV2 bandwidth lead this particular 

~(T1) 

ranking with 4 first positions each, followed by the conventional estimator fx and f x with a- 
WLSCV2 bandwidth, 3 first positions each. On the Mean Integrated Square Error averaged over 



the 16 densities (column 'Total' of Table 5.1), the three improved probit-transformation estimators 



f x and bandwidth of type fc-NN lead the way. The exact weight function used in (4.2 ) is not much 



important, with a slight preference for the WSLCV1 weight, though. Fourth comes the amended 
naive probit-transformation estimator fx , and fifth fx" 1 '* with a-WLSCVl bandwidth. 

If a score of 1 point is given to the estimator ranked first (on MISE) for a particular density, a score 
of 2 points is given to the estimator ranked second, etc., a 'robust' overall ranking may be obtained. 
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~i r p c y\ 

Again, the three improved probit-transformation estimators f x with fc-NN bandwidth come first 
(a-WLSCVl, 84 pts; a-WLSCV2, 91 pts; a-LSCV, 115 pts). The Gaussian Copula kernel estimator 
is fourth (136 pts) and /j^ with a-WLSCV2 bandwidth fifth (139 pts). In this particular ranking, 
/x.diff is 7th (147 pts) and fx,c 2 only 10th (156 pts). 

Of course, an obvious reason why fx,c 2 an d /x,gc do not perform better on the average is their 
terrible MISE for the multimodal densities 2, 13 and 14, as a result of using a totally inappropriate 
bandwidth. As stated in the previous section, however, there is currently no other reliable bandwidth 



selection rule available in the literature for those two estimators. For instance, Chen (1999) himself 
sometimes opted for subjective bandwidth selection in his simulations since the devised cross- 
validation-based procedure was not trustworthy. Bandwidth problems for the Beta kernel estimator 



are also discussed in Zhang and Karunamuni (2010). Besides, the bandwidth is only part of the 



picture. For Density 1, which is the Beta(4, 4) density, the 'rule-of-thumb' bandwidth for fx,c 2 
should be optimal, but fx,c 2 ^ s s ^ beaten by the best probit-transformation estimators fx^ 
(bandwidth a-WLSCVl and a-WLSCV2). The same observation holds for fx,GC an d its optimal 
bandwidth for Density 15: it is also beaten there by fx (bandwidth a-WLSCVl and a-WLSCV2). 
It is also noticeable that fx,c 2 really struggles at estimating the unbounded Densities 5 and 10 (this 
is also the case for fx,dis)- The Gaussian Copula estimator fx,GC can reasonably cope with that 
feature, but is again unable to compete with the improved probit-transformation estimators on it. 

Some other, minor but still noteworthy points are the following. In these small samples, the simple 



boundary bias correction suggested in Dai and Sperlich (2010) is not effective. It manages to reduce 



the MISE of the conventional estimator for the unbounded densities 5 and 10, for obvious reasons, 
but for the other cases both estimators are roughly level (the conventional estimator even appears 
slightly better). On the other hand, the amendment made to the naive probit-transformation 
estimator seems to effectively improve its MISE. Only for the unbounded densities is the MISE 
of the initial version smaller than that of the amended version, which is easily understood: the 
amendment tends to prevent the estimate from growing unboundedly at the boundaries. 

Comparing the improved probit-transformation estimators between them, it is clear that using a 
bandwidth of type fc-NN is a must, with this small sample size (n = 50). The fixed bandwidth 

~(T2) ~(T1) 

versions do not work well, especially for the estimators f x . For the estimators f x , the situation 
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is less clear-cut, and for some density shapes a fixed bandwidth may sometimes improve on a fc-NN 
bandwidth. For other shapes, however, it may yield very poor results (densities 12 and 16). It 
is also noticeable that, when a fc-NN bandwidth is used, local log-quadratic estimation of fs is 
preferable over local log-linear estimation, but it is the contrary when a fixed bandwidth is used. 
Finally, it is clear that the local-likelihood estimators directly targeting fx without going through 
the transformation (estimators fx) cannot compete, as anticipated in Section 



3.3 



5.2 Results - n = 500 



For larger samples (n = 500) (see Table 5.2), the same conclusions essentially hold true, although 
there are some noteworthy changes. The diffusion estimator fx,<m is now t ne sole leader in the 'first 
positions' ranking, with 5 first positions. It is followed by the Beta kernel estimator fx,c 2 (4 first 
positions), the conventional estimator fx and f^ x ^ with a-WLSCV2 bandwidth (3 first positions). 
On the average MISE, this is now f^ 2 ^ with a-LSCV bandwidth which is the best, just before the 
same f x with a-WLSCVl bandwidth and f x with a-WLSCVl bandwidth. Very closely follow 
the naive probit-transformation estimator f x and the bias-corrected conventional estimator fx, CO w 
In the ranking induced by the ranks, f x - a-LSCV is again first (106 pts), ahead of f x - a- 
WLSCV2 (114 pts), f x > - a- WLSCV1 (122 points) and the amended naive probit-transformation 
estimator f x T ^ (136 points). The diffusion estimator fx,<as (137 pts) completes the Top 5. 

~(T2) 

The main point is that f x - a-WLSCV2 has now disappeared from the top positions. In fact, 
the WLSCV2 criterion puts a special emphasis on the tails of fs when estimating it, and this is 
necessary to achieve smooth estimates of those tails in small samples, when very few observations 
fall in the tail regions. Contrariwise, in large samples, enough observations are found even in the 
tails of the density and the 'moderate' tail emphasis driven by the criterion WLSCV1, or even no 
particular emphasis at all (LSCV), works well enough. The other observations made in the previous 
subsection for n = 50 remain mostly true for sample of size n = 500. 

All in all, without distinction on sample size, the improved probit-transformation estimator f x 
with fc-NN bandwidth selected by weighted cross-validation (WLSCV1) appears to be the best, 
overall. It seems therefore reasonable to recommend it as preferred all-around estimator for non- 
parametrically estimating a density on the unit interval. In small samples, a particular emphasis on 
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the tails of f$, through the weight function WLSCV2, may be appropriate. In fact, the most serious 
competitor of these estimators seems to be the diffusion kernel estimator fx,<m- With its automatic 
bandwidth selector, it is often doing well, except for unbounded and multimodal densities. The 
estimators fx" with fc-NN bandwidth, however, never do very badly, and often do very well. 

Actually, they lead many of the rankings one could think of to assess the relative performance of 
the 23 estimators that were considered. Indeed, the foregoing discussion only considered the Mean 
Integrated Squared Error, but the same conclusions were drawn by looking only at the boundary 
behavior of the estimators. For instance, for the same simulations the Mean Squared Error of each 



estimator at x = 0.01 and x = 0.99 was calculated (this analysis again mimics that in Jones and 



Henderson (2007b)), and the improved probit-transformation estimators were again found to be 
best on the average, to a similar extent. These results are not shown for sake of brevity, but are 
available upon request. It is also stressed that the proposed cross-validation bandwidth selection 
rules showed very consistent behaviors over the Monte-Carlo replications. In other situations, it is 
well-known that similar LSCV ideas may produce highly variable bandwidths. 

6 Real data example 

In this section the improved probit-transformation method is used to estimate a probability density 
from a real data set. The data give the proportion X of white student enrollment in n = 56 school 
districts in Nassau County (Long Island, New York), for the 1992-1993 school year. Estimating 
the density of X has been of interest to assess the common perception in the US in the 90 's that 
public schools were still strongly segregated by race, despite political effort to integrate them. Of 
course, X being a proportion, its density is supported on X = [0, 1] and estimating it is somewhat 
problematic for the reasons explained in Section [TJ This data set was, among others, considered in 



Simonoff (1996, Sections 3.2 and following) for illustrating boundary bias problems. 



Figure 6.1 (left panel) shows the Beta kernel estimator fx,c 2 (with rule-of-thumb bandwidth h = 
0.320) and the Gaussian Copula kernel estimator fx,GC (with rule-of-thumb bandwidth h = 0.408) 
superimposed on a sample histogram. The Gaussian Copula kernel estimator shows a peak at 
the boundary, which is actually meaningful here. In fact, two schools showed a white students 
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enrollment extremely close to 0, and the estimator attempts to put a positive probability mass 
atom at 0, hence the spike. The Beta kernel estimator totally misses this. On the other side, simple 
visual inspection (see the bottom of the histogram) reveals that there are no observations very close 
to 1, so the density should reach a maximum at around 0.9-0.95 and then tumble down to when 
approaching 1. Again, this is mostly what fx,cc shows, but the Beta kernel estimator fails to catch 
this as it remains roughly constant for x- values between 0.85 and 1. 

Note that both estimates appear noticeably oversmoothed, which is not surprising given the clear 



bimodal nature of the data and the way their bandwidth was selected. Figure 6.1 (right panel) shows 
the same estimators but with initial bandwidths divided by two. Such small bandwidths (and even 
smaller, actually) are necessary to visually recover the curvature of the density for x between 0.5 
and 0.9, as suggested by the histogram. The above mentioned two features of the density (spike at 
0, drop at 1) are now even more pronounced on the Gaussian Copula kernel estimator whereas the 
Beta kernel estimator still struggles to evidence them. Both appear quite undersmoothed this time, 
which makes them visually unpleasant with spurious bumps arising for x-values between and 0.4. 




Figure 6.1: Beta kernel (plain) and Gaussian Copula (dashed) estimates with |Jones and Hender- 
son ( 2007afb ) 's 'rule-of-thumb' bandwidths (left panel) and half those (right panel), for the white 
students enrollment data. These estimates are overlaid to a sample histogram, and the actual 
observations are displayed as a 'rug' along the the bottom of the graph. 



Finally the improved probit-transformation kernel density estimator, based on local log-quadratic 
estimation of fs (i.e. f x ) and fc-NN bandwidth, was used to estimate fx- The bandwidth was 
selected via weighted least-squares cross-validation with weight u(s) = (f>(s)/fs(s) (WLSCV2), 
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given the small sample size (n = 56). The appearance of the Weighted Least Squares Cross- 



Validation criterion, as in (4.2), is shown as a function of a in Figure 6.3 The selected value was 



a = 0.885, resulting in the estimate in Figure 6J2 It has a smooth and pleasant appearance, without 
oversmoothing, and clearly evidences the main features of the underlying density. This nice fit was 
obtained in a totally automated manner. With the WLSCVl weight function oj(s) = \J '4>(s) / J 's(s) , 
the selected value of a was 0.92, yielding essentially the same estimate which is therefore not 
shown. As a comparison, the diffusion kernel estimator (computed with its automatic bandwidth) 
is also shown in the figure, in addition to the previous two Beta and Gaussian Copula kernel 
estimators. The estimate fx,dm fails to evidence the two features of the density. In fact, the 
diffusion kernel estimator has a propensity for providing estimates whose derivatives are zero at 



the boundaries, as shown in Botev et al (2010)'s equations (3) and (4) and the related comments. 



This is clear from Figure 6.2 The obtained estimate can also be compared to Figures 3.9 and 



3.17 in Simonoff (1996), which show, respectively, the conventional kernel density estimate and the 
local log-quadratic estimate directly computed on the raw data, i.e. without transformation. The 
improvement brought by transforming the data is visually obvious. 




Figure 6.2: Probit-transformation kernel density estimator for the white students enrollment data. 
Local log-quadratic density estimation was used with a fc-NN bandwidth selected by weighted least- 
squares cross-validation (a = 0.885). Previous Beta and Gaussian Copula kernel estimators with 
'rule-of-thumb' bandwidths, and the 'diffusion' estimator are also shown for easy comparison. 
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Figure 6.3: Appearance of the Weighted Least Squares Cross- Validation criterion WLSCV2, as a 
function of a, for the the white students enrollment data. The selected value is a = 0.885. 

7 Concluding remarks 

In this paper, a kernel estimator for unknown densities supported on the unit interval X = [0, 1] 
has been devised. The suggested procedure is actually an improved version of the basic idea of 
the transformation kernel density estimator, which has been around for a long time but which had 
never been implemented simply and efficiently As such, the proposed methodology fills a gap in the 
literature. A fixed, probit transformation is used in all situations, and it is argued that an approach 
based on local likelihood methods mostly cures the shortcomings previous 'naive' transformation- 
based estimators suffer from. A range of such 'improved' probit-transformation kernel density 
estimators, varying in the exact nature of the local likelihood method which is used and in the way 
the bandwidth is selected, has been studied. Asymptotic expressions for the bias and the variance 
have been derived for the cases of interest. In particular, if the density to estimate is smooth enough 
(four continuous derivatives), it has been shown that the bias of some versions of the estimator is 
of order 0(/i 4 ) without the need for ad hoc bias correction methods. This is, of course, a property 
taken over from the local likelihood density estimation methods which enjoy it at the source. In 
favorable situations, the bias there may even be o(/i 4 ), under the same smoothness assumption. A 
comprehensive simulation study (23 estimators were studied on 16 different densities on [0,1]) has 
confirmed the very good behavior of the suggested probit-transformation estimators. In particular, 
an estimator based in local log-quadratic estimation of the density in the transformed domain, 
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coupled with a bandwidth of type fc-nearest neighbor selected via weighted cross-validation, has 
positioned itself as the best choice. Remarkably, the cross-validation criterion provides, here, a 
reliable way of selecting the smoothing parameter, unlike what has sometimes been observed in 
other settings. On the average, the improved probit-transformation estimators have been seen to 



outperform their main competitors, namely the Beta kernel estimator (Chen, 1999), the Gaussian 



Copula estimator (Jones and Henderson, 2007afb ) and the diffusion estimator (Botev et al , 2010). 

Some theoretical refinement may be needed to fully contemplate the potential of the suggested 
method. In particular, local likelihood estimators are known to admit different asymptotics whether 
the smoothing parameter is forced or not to tend to as n — > oo. The areas of application of the 
two types of results is evidently quite difficult to delimit in practice, and this is even more the case 
when the bandwidth is variable and random, like a bandwidth of type /c-nearest neighbor. As this 
type of smoothing parameter has been seen to do better than the fixed bandwidth versions, it would 
be of great interest to derive an appropriate unified asymptotic theory for that case. In this work, 
only heuristic arguments have been proposed to motivate using fc-NN bandwidths. 

On a more applied perspective, one can wonder how to use the present estimator of densities on [0, 1] 



for estimating densities supported on [0, +oo). Jones and Henderson (2007b) suggested a procedure 
based on a second transformation: if Y is has a density supported on IR + , then X = Y/(Y + 1) 
(for instance) is supported on [0, 1], and the density of X can be estimated with one of the probit- 
transformation estimator discussed in this paper. This estimated density of X can then be back- 
transformed to 1R + to provide an estimate of the density of Y. This procedure is indeed totally in 
the same spirit as what has been discussed here, and it would be interesting to study how well it 
works for curing boundary bias of kernel estimation of densities of positive random variables. 
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